{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0FiYdWV7Ar_I"
      },
      "source": [
        "Copyright 2021 Google LLC.\n",
        "\n",
        "SPDX-License-Identifier: Apache-2.0"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "SxNpEqNOCboj"
      },
      "outputs": [],
      "source": [
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7tYrkvwgEJoZ"
      },
      "source": [
        "# Assessing the veracity of semantic markup for dataset pages\n",
        "\n",
        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/google-research/google-research/blob/master/dataset_or_not/dataset_or_not.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://github.com/google-research/google-research/tree/master/dataset_or_not/dataset_or_not.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "\u003c/table\u003e"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "O-FjzADgGR5j"
      },
      "source": [
        "## About this Colab\n",
        "This is a companion Colab for the paper \n",
        "\n",
        "[Dataset or Not? A study on the veracity of semantic markup for dataset pages]()\\\n",
        "*Tarfah Alrashed, Dimitris Paparas, Omar Benjelloun, Ying Sheng, and Natasha Noy*\n",
        "\n",
        "It contains python code for training the two main models from the paper, using the [Veracity of schema.org for datasets (labeled data)](https://www.kaggle.com/googleai/veracity-of-schemaorg-for-datasets-labeled-data) dataset.\n",
        "\n",
        "## Prerequisites\n",
        "\n",
        "Before continuing, download and unzip the [Veracity of schema.org for datasets (labeled data)](https://www.kaggle.com/googleai/veracity-of-schemaorg-for-datasets-labeled-data) dataset to your computer.\n",
        "\n",
        "## Note regarding *prominent terms*\n",
        "\n",
        "The released dataset and the code in this notebook do not contain the *prominent terms* feature mentioned in the paper. This is because that feature is extracted using proprietary code that cannot be released. The interested reader can replicate this feature extraction using the model proposed in [this paper](https://arxiv.org/abs/1805.01334)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KZwERv32GV4z"
      },
      "source": [
        "#Install required packages"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OhdmjXkkqGR7"
      },
      "outputs": [],
      "source": [
        "!pip install adanet\n",
        "!pip install --user --upgrade tensorflow-probability"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ITZ55s-P2AXL"
      },
      "source": [
        "#Import Modules"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kusq84g_p94T"
      },
      "outputs": [],
      "source": [
        "from google.colab import files\n",
        "import math\n",
        "import tensorflow.compat.v2 as tf\n",
        "import adanet\n",
        "import pandas as pd\n",
        "import io\n",
        "\n",
        "from tensorflow.keras.preprocessing import sequence\n",
        "from tensorflow.keras.preprocessing import text"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FN3TWLjg1v3r"
      },
      "source": [
        "# Upload Dataset\n",
        "\n",
        "Run the following cell and, when prompted, upload files *testing_set.csv*, *training_set.csv*, and *validation_set.csv* (that you downloaded as part of the prerequisites)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Y9iCyym11vQk"
      },
      "outputs": [],
      "source": [
        "uploaded = files.upload()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XfkASPTe3VNG"
      },
      "source": [
        "# Load dataset in pandas.DataFrame"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mk1ZmWOG3Z4F"
      },
      "outputs": [],
      "source": [
        "training_set = pd.read_csv('training_set.csv', keep_default_na=False)\n",
        "eval_set = pd.read_csv('validation_set.csv', keep_default_na=False)\n",
        "test_set = pd.read_csv('testing_set.csv', keep_default_na=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XZ9P01jZ_P2c"
      },
      "source": [
        "# Select Model\n",
        "\n",
        "In the next cell you can select which model to train. Remember to run the cell after making a selection. The features each model uses are:\n",
        "\n",
        "Column Name|Type|Contents|Lightweight Model|Full Model\n",
        "-----------|----|-|:-:|:--------:\n",
        "source_url| string |url of a webpage that contains schema.org/Dataset markup| |+\n",
        "name| string |The name of the dataset| +|+\n",
        "description| string |Description of the dataset|+|+\n",
        "has_distribution| bool|True if the dataset contains distribution metadata, false otherwise| |+\n",
        "has_encoding_or_file_format| bool |True if the dataset contains encoding or file format metadata, false otherwise| |+\n",
        "provider_or_publisher| string |The name of the provider or publisher of the dataset| |+\n",
        "author_or_creator| string |The author(s) or creator(s) of the dataset| |+\n",
        "doi| string|The Digital Object Identifier of the dataset| |+\n",
        "has_catalog| bool |True if the dataset is included in a data catalog, false otherwise| |+|\n",
        "has_dateCreated| bool |True if a creation date is provided, false otherwise| |+\n",
        "has_dateModified| bool |True if a modification date is provided, false otherwise| |+\n",
        "has_datePublished| bool |True if a publication date is provided, false otherwise| |+"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "iGEKT8lk_dQ9"
      },
      "outputs": [],
      "source": [
        "SELECTED_MODEL = 'lightweight_model'  #@param {type:'string'} [\"lightweight_model\", \"full_model\"]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "l_etikoRVa8w"
      },
      "source": [
        "# Preprocessing Parameters\n",
        "\n",
        "Dictionary with the sizes of the feature vocabularies to generate during preprocessing for each of the models"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "hgWaVoQEVqsK"
      },
      "outputs": [],
      "source": [
        "P_PARAMS_BY_MODEL = {\n",
        "    'lightweight_model': {\n",
        "        'vocab_size_by_feature': {\n",
        "            'description': 110211,\n",
        "            'name': 18720\n",
        "        },\n",
        "        'MAX_TOKENS': 400\n",
        "    },\n",
        "    'full_model': {\n",
        "        'vocab_size_by_feature': {\n",
        "            'description': 104383,\n",
        "            'name': 17495,\n",
        "            'author_or_creator': 1602,\n",
        "            'doi': 193,\n",
        "            'provider_or_publisher': 773,\n",
        "            'source_url': 17749\n",
        "        },\n",
        "        'MAX_TOKENS': 400\n",
        "    }\n",
        "}\n",
        "\n",
        "MODEL_P_PARAMS = P_PARAMS_BY_MODEL[SELECTED_MODEL]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "r0N9Pf6555ge"
      },
      "source": [
        "# Data Preprocessing\n",
        "\n",
        "Analyze training dataset and generate tokenizers with custom vocabularies for each text feature"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "d-tHyTdn6Fy1"
      },
      "outputs": [],
      "source": [
        "tokenizers = {}\n",
        "\n",
        "for feature_name, vocab_size in MODEL_P_PARAMS['vocab_size_by_feature'].items():\n",
        "  tokenizers[feature_name] = text.Tokenizer(num_words=vocab_size)\n",
        "  tokenizers[feature_name].fit_on_texts(training_set[feature_name])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "t5YK0Q1TAaZ3"
      },
      "source": [
        "# Hyperparametes\n",
        "\n",
        "Dictionary with the training hyperparameters for each of the models"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xW2NqYpmAZn5"
      },
      "outputs": [],
      "source": [
        "H_PARAMS_BY_MODEL = {\n",
        "    'lightweight_model': {\n",
        "        'features': ['description', 'name'],\n",
        "        'LEARNING_RATE': 0.00677,\n",
        "        'TRAIN_STEPS': 500,\n",
        "        'SHUFFLE_BUFFER_SIZE': 2048,\n",
        "        'BATCH_SIZE': 128,\n",
        "        'CLIP_NORM': 0.00037,\n",
        "        'HIDDEN_UNITS': [186],\n",
        "        'DROPOUT': 0.28673,\n",
        "        'ACTIVATION_FN': tf.nn.selu,\n",
        "        'MAX_ITERATION_STEPS': 333333,\n",
        "        'DO_BATCH_NORM': True,\n",
        "        'MAX_TRAIN_STEPS': 1000\n",
        "    },\n",
        "    'full_model': {\n",
        "        'features': [\n",
        "            'author_or_creator', 'description', 'doi', 'has_date_created',\n",
        "            'has_date_modified', 'has_date_published', 'has_distribution',\n",
        "            'has_encoding_or_file_format', 'name', 'provider_or_publisher',\n",
        "            'source_url'\n",
        "        ],\n",
        "        'LEARNING_RATE': 0.00076,\n",
        "        'TRAIN_STEPS': 500,\n",
        "        'SHUFFLE_BUFFER_SIZE': 2048,\n",
        "        'BATCH_SIZE': 128,\n",
        "        'CLIP_NORM': 0.25035,\n",
        "        'HIDDEN_UNITS': [329, 351, 292],\n",
        "        'DROPOUT': 0.08277,\n",
        "        'ACTIVATION_FN': tf.nn.selu,\n",
        "        'MAX_ITERATION_STEPS': 333333,\n",
        "        'DO_BATCH_NORM': False,\n",
        "        'MAX_TRAIN_STEPS': 1000\n",
        "    }\n",
        "}\n",
        "\n",
        "MODEL_H_PARAMS = H_PARAMS_BY_MODEL[SELECTED_MODEL]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vCtOrKZ41k91"
      },
      "source": [
        "# Utility functions\n",
        "\n",
        "Methods used to preprocess and create the input for training the model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NR0-SK5ddFmc"
      },
      "outputs": [],
      "source": [
        "def tokenize_and_pad(features):\n",
        "  \"\"\"Iterates over the features of a labeled sample, tokenizing and padding them.\n",
        "\n",
        "  Args:\n",
        "    features: A dictionary of feature values keyed by feature names. It includes\n",
        "      label as a feature\n",
        "\n",
        "  Returns:\n",
        "    A tuple with the processed features\n",
        "  \"\"\"\n",
        "\n",
        "  tokenized_features = list()\n",
        "  for feature in MODEL_H_PARAMS['features']:\n",
        "    # Tokenize text features according to the corresponding vocabulary\n",
        "    if feature in MODEL_P_PARAMS['vocab_size_by_feature']:\n",
        "      # Handle missing features\n",
        "      if not features[feature]:\n",
        "        tokenized = [[MODEL_P_PARAMS['vocab_size_by_feature'][feature]]]\n",
        "      else:\n",
        "        tokenized = tokenizers[feature].texts_to_sequences([features[feature]])\n",
        "      tokenized_features.append([\n",
        "          sequence.pad_sequences(\n",
        "              tokenized,\n",
        "              maxlen=MODEL_P_PARAMS['MAX_TOKENS'],\n",
        "              padding='post',\n",
        "              truncating='post')\n",
        "      ])\n",
        "    # Tokenize boolean features into binary values\n",
        "    else:\n",
        "      if features[feature]:\n",
        "        tokenized_features.append([1])\n",
        "      else:\n",
        "        tokenized_features.append([0])\n",
        "  tokenized_features.append(features['label'])\n",
        "  return tuple(tokenized_features)\n",
        "\n",
        "\n",
        "def generator(dataset):\n",
        "  \"\"\"Returns a generator mapping dataset entries to tokenized features-label pairs.\"\"\"\n",
        "\n",
        "  def _gen():\n",
        "    for entry in dataset.iterrows():\n",
        "      yield tokenize_and_pad(entry[1])\n",
        "\n",
        "  return _gen\n",
        "\n",
        "\n",
        "def preprocess(*args):\n",
        "  \"\"\"Tensorizes its arguments.\n",
        "\n",
        "  Args:\n",
        "    *args: Variable length arguments feature1, ..., featureK, label. Should be\n",
        "      in the same order as in MODEL_H_PARAMS['features']\n",
        "\n",
        "  Returns:\n",
        "    A pair of\n",
        "      1. A dictionary with the features keyed by their names\n",
        "      2. A label\n",
        "  \"\"\"\n",
        "  m = {}\n",
        "  for feature, name in zip(args[:-1], MODEL_H_PARAMS['features']):\n",
        "    m[name] = feature\n",
        "  return m, [args[-1]]\n",
        "\n",
        "\n",
        "def generate_output_types():\n",
        "  \"\"\"Returns a vector of output types corresponding to the tuple produced by the generator.\"\"\"\n",
        "  types = []\n",
        "  # Feature types\n",
        "  types = [tf.int32] * len(MODEL_H_PARAMS['features'])\n",
        "  # Label type\n",
        "  types.append(tf.bool)\n",
        "  return tuple(types)\n",
        "\n",
        "\n",
        "def input_fn(partition, training, batch_size):\n",
        "  \"\"\"Generates an input_fn for the Estimator.\n",
        "\n",
        "  Args:\n",
        "    partition: One of 'train', 'test', and 'eval' for training, testing, and\n",
        "      validation sets respectively\n",
        "    training: If true, then shuffle dataset to add randomness between epochs\n",
        "    batch_size: Number of elements to combine in a single batch\n",
        "\n",
        "  Returns:\n",
        "    The input function\n",
        "  \"\"\"\n",
        "\n",
        "  def _input_fn():\n",
        "    if partition == 'train':\n",
        "      dataset = tf.data.Dataset.from_generator(\n",
        "          generator(training_set), generate_output_types())\n",
        "    elif partition == 'test':\n",
        "      dataset = tf.data.Dataset.from_generator(\n",
        "          generator(test_set), generate_output_types())\n",
        "    elif partition == 'eval':\n",
        "      dataset = tf.data.Dataset.from_generator(\n",
        "          generator(eval_set), generate_output_types())\n",
        "    else:\n",
        "      print('Unknown partition')\n",
        "      return\n",
        "\n",
        "    if training:\n",
        "      dataset = dataset.shuffle(MODEL_H_PARAMS['SHUFFLE_BUFFER_SIZE'] *\n",
        "                                batch_size).repeat()\n",
        "\n",
        "    dataset = dataset.map(preprocess).batch(batch_size)\n",
        "    iterator = tf.compat.v1.data.make_one_shot_iterator(dataset)\n",
        "    features, labels = iterator.get_next()\n",
        "    return features, labels\n",
        "\n",
        "  return _input_fn\n",
        "\n",
        "\n",
        "def generate_feature_columns(embed):\n",
        "  \"\"\"Creates the feature columns that we will train the model on.\n",
        "\n",
        "  Args:\n",
        "    embed: If true, we embed the columns.\n",
        "\n",
        "  Returns:\n",
        "    A list with the feature columns.\n",
        "  \"\"\"\n",
        "  feature_columns = []\n",
        "  for feature in MODEL_H_PARAMS['features']:\n",
        "    if feature in MODEL_P_PARAMS['vocab_size_by_feature']:\n",
        "      # vocab_size + 1 to handle missing features\n",
        "      num_buckets = MODEL_P_PARAMS['vocab_size_by_feature'][feature] + 1\n",
        "    else:\n",
        "      # All none-text features are booleans, so 2 buckets are enough\n",
        "      num_buckets = 2\n",
        "    column = tf.feature_column.categorical_column_with_identity(\n",
        "        key=feature, num_buckets=num_buckets)\n",
        "    if embed:\n",
        "      column = tf.feature_column.embedding_column(\n",
        "          column, dimension=math.ceil(math.log2(num_buckets)))\n",
        "    feature_columns.append(column)\n",
        "  return feature_columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fUKF4a7R4FWG"
      },
      "source": [
        "# Build Model\n",
        "\n",
        "Set up an ensemble estimator combining a Linear estimator and a DNN"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "di00VeBbvXHa"
      },
      "outputs": [],
      "source": [
        "head = tf.estimator.BinaryClassHead()\n",
        "\n",
        "adam = lambda: tf.keras.optimizers.Adam(\n",
        "    learning_rate=MODEL_H_PARAMS['LEARNING_RATE'],\n",
        "    clipnorm=MODEL_H_PARAMS['CLIP_NORM'])\n",
        "\n",
        "estimator = adanet.AutoEnsembleEstimator(\n",
        "    head=head,\n",
        "    candidate_pool={\n",
        "        'linear':\n",
        "            tf.estimator.LinearEstimator(\n",
        "                head=head,\n",
        "                feature_columns=generate_feature_columns(False),\n",
        "                optimizer=adam),\n",
        "        'dnn':\n",
        "            tf.estimator.DNNEstimator(\n",
        "                head=head,\n",
        "                hidden_units=MODEL_H_PARAMS['HIDDEN_UNITS'],\n",
        "                feature_columns=generate_feature_columns(True),\n",
        "                optimizer=adam,\n",
        "                activation_fn=MODEL_H_PARAMS['ACTIVATION_FN'],\n",
        "                dropout=MODEL_H_PARAMS['DROPOUT'],\n",
        "                batch_norm=MODEL_H_PARAMS['DO_BATCH_NORM'])\n",
        "    },\n",
        "    max_iteration_steps=MODEL_H_PARAMS['MAX_ITERATION_STEPS'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5Fb7wW8o4qMm"
      },
      "source": [
        "# Train Model\n",
        "\n",
        "For demonstration purposes, we set *max_steps* to a small value so that the\n",
        "training finishes fast. This is enough to achieve good results. Alternatively, you can remove the *max_steps* argument and let the estimator train to convergence."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rX_sbCcZ4oZl"
      },
      "outputs": [],
      "source": [
        "result = tf.estimator.train_and_evaluate(\n",
        "    estimator,\n",
        "    train_spec=tf.estimator.TrainSpec(\n",
        "        input_fn=input_fn(\n",
        "            'train', training=True, batch_size=MODEL_H_PARAMS['BATCH_SIZE']),\n",
        "        max_steps=MODEL_H_PARAMS['MAX_TRAIN_STEPS']),\n",
        "    eval_spec=tf.estimator.EvalSpec(\n",
        "        input_fn=input_fn(\n",
        "            'eval', training=False, batch_size=MODEL_H_PARAMS['BATCH_SIZE']),\n",
        "        steps=None,\n",
        "        start_delay_secs=1,\n",
        "        throttle_secs=1,\n",
        "    ))[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CLif6fwgZeAJ"
      },
      "source": [
        "# Model perfomance on validation set"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tdK0LraQgrvH"
      },
      "outputs": [],
      "source": [
        "print('AUC:', result['auc'], 'AUC_PR:', result['auc_precision_recall'],\n",
        "      'Recall:', result['recall'], 'Precision:', result['precision'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2RuzrIl_ZxmH"
      },
      "source": [
        "# Model perfomance on testing set"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nG0avhs5x2s_"
      },
      "outputs": [],
      "source": [
        "ret = estimator.evaluate(\n",
        "    input_fn('test', training=False, batch_size=MODEL_H_PARAMS['BATCH_SIZE']))\n",
        "print('AUC:', ret['auc'], 'AUC_PR:', ret['auc_precision_recall'], 'Recall:',\n",
        "      ret['recall'], 'Precision:', ret['precision'])"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "dataset_or_not.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
